Metadata extraction of non-transcribed video and audio streams

ABSTRACT

A system and computer based method for transcribing and extracting metadata from a source media. A processor-based server extracts audio and video stream from the source media. A speech recognition engine processes the audio and/or video stream to transcribe the audio and/or video stream into a time-aligned textual transcription and to extract audio amplitude by time interval, thereby providing a time-aligned machine transcribed media. The server processor measures the aural amplitude of the extracted audio amplitude and assigns a numerical value that is normalized to a single, normalized, universal amplitude scale. A database stores the time-aligned machine transcribed media, time-aligned video frames and the assigned value from the normalized amplitude scale.

RELATED APPLICATION

This application is a continuation of application Ser. No. 14/719,125filed May 21, 2015, which is a continuation-in-part application ofapplication Ser. No. 14/328,620 filed Jul. 10, 2014, which claims thebenefit of Provisional Application No. 61/844,597 filed Jul. 10, 2013,each of which is incorporated herein by reference in its entirety.

BACKGROUND OF THE INVENTION

The invention relates to audio/video/imagery processing, moreparticularly to audio/video/imagery metadata extraction and analytics.

Extraction and analysis of non-transcribed media has typically been alabor-intensive process, typically human driven, which does not allowfor extensive and consistent metadata extraction in rapid fashion. Oneor more person has to view and listen to the source media, e.g., audioand/or video content, and manually transcribe the corresponding audio togenerate an index of what took place and when, or to generate closedcaptioning text that is synchronized to the video. To manually locateand record a timestamp for even a small fraction of the speech andscript elements often requires several hours of manual work, and doingthis for the entire source media may require several days or more.

Currently available systems and methods deal with the extraction andanalysis of transcribed media. Currently available systems and methodstime-match a written script text to raw speech transcript produced froman analysis of recorded dialog to ensure accuracy of the transcript.That is, transcribed source media is processed and the resulting speechrecognized transcript is compared to the written script to ensureaccuracy. Such transcripts are used in movie industry and videoproduction environment to search or index video/audio content based onthe text provided in the written script. Also, aligned transcript can beused to generate closed caption text that is synchronized to actualspoken dialog in the source media.

These automated techniques for time-synchronizing scripts andcorresponding video to pre-existing written script typically utilize aword alignment matrix (e.g., script words vs. transcript words). But,they are traditionally slow and error-prone. These techniques oftenrequire a great deal of processing and may contain a large number oferrors, rendering the output inaccurate. For example, due to noise orother non-dialogue artifacts, in speech-to-text transcripts the wrongtime values, off by several minutes or more, are often assigned toscript text. As a result, the transcript may not be reliable, therebyrequiring additional time to identify and correct the errors, or causingusers to shy away from its use altogether.

The problems are exacerbated when one must extract non-transcribed mediabecause there is no written script to compare the speech transcript foraccuracy.

Accordingly, it is desirable to provide a technique for providingefficient and accurate time-aligned machine transcribed media that isnormalized to a single universal amplitude scale. That is, the claimedinvention proceeds upon the desirability of providing method and systemfor storing and applying automated machine speech and facial/entityrecognition to large volumes of non-transcribed video and/or audio mediastreams to provide searchable transcribed content that is normalized toa single universal amplitude scale. The searchable transcribed contentcan be searched and analyzed for metadata to provide a uniqueperspective onto the data via server-based queries.

OBJECTS AND SUMMARY OF THE INVENTION

An object of the claimed invention is to provide a system and methodthat transcribes and normalizes non-transcribed media, which can includeaudio, video and/or imagery, to a single universal amplitude scale.

Another object of the claimed invention is to provide aforesaid systemand method that analyzes the non-transcribed media frame by frame.

A further object of the claimed invention is to provide aforesaid systemand method that extracts metadata relating to sentiment, psychology,socioeconomic and image recognition traits.

In accordance with an exemplary embodiment of the claimed invention, acomputer based method is provided for transcribing and extractingmetadata from a source media. A processor-based server extracts an audiostream from the source media and normalizes the audio stream to a singleuniversal amplitude scale by generating an audio histogram. Theprocessor-based server determines a loudest frame of the audio streamwith the loudest sound and a softest frame of the audio stream with thesoftest sound. A normalized minimum amplitude value is assigned to thesoftest frame and a normalized maximum amplitude value is assigned tothe loudest frame. Each frame of the audio stream is then compared tothe loudest frame and to the softest frame by utilizing the audiohistogram, and assigned a normalized amplitude value between thenormalized minimum and maximum values in accordance with the comparisonresult. The normalized amplitude value is stored for each frame of theaudio stream in a database.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid audio stream is processed by a speech recognition engine totranscribe the audio stream into a time-aligned textual transcription,thereby providing a time-aligned machine transcribed media. The serverprocessor process the time-aligned machine transcribed media to extracttime-aligned textual metadata associated with the source media. Thetime-aligned machine transcribed media and the time-aligned textualmetadata are stored in the database.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid method performs a textual sentiment analysis on a full or asegment of the time-aligned textual transcription by the serverprocessor to extract time-aligned sentiment metadata. Database lookupsare performed based on predefined sentiment weighed texts stored in thedatabase. One or more matched time-aligned sentiment metadata isreceived from the database by the server processor.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid method performs a natural language processing on a full or asegment of the time-aligned textual transcription by the serverprocessor to extract time-aligned natural language processed metadatarelated to at least one of the following: an entity, a topic, a keytheme, a subject, an individual, and a place. Database lookups areperformed based on predefined natural language weighed texts stored inthe database. One or more matched time-aligned natural language metadatais received from the database by the server processor.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid method performs a demographic estimation processing on a fullor a segment of the time-aligned textual transcription by the serverprocessor to extract time-aligned demographic metadata. Database lookupsare performed based on predefined word/phrase demographic associationsstored in the database. One or more matched time-aligned demographicmetadata is received from the database by the server processor.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid method performs a psychological profile estimation processingon a full or a segment of the time-aligned textual transcription by theserver processor to extract time-aligned psychological metadata.Database lookups are performed based on predefined word/phrasepsychological profile associations stored in the database. One or morematched time-aligned psychological metadata is received from thedatabase by the server processor.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid method performs at least one of the following: a textualsentiment analysis on the time-aligned machined transcribed media by theserver processor to extract time-aligned sentiment metadata; a naturallanguage processing on the time-aligned machined transcribed media bythe server processor to extract time-aligned natural language processedmetadata related to at least one of the following: an entity, a topic, akey theme, a subject, an individual, and a place; a demographicestimation processing on the time-aligned machined transcribed media bythe server processor to extract time-aligned demographic metadata; and apsychological profile estimation processing on the time-aligned machinedtranscribed media by the server processor to extract time-alignedpsychological metadata.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid method extracts a video stream from the source media by avideo frame engine of the processor-based server. The time-aligned videoframes are extracted from the video stream by the video frame engine.The time-aligned video frames are stored in the database. Thetime-aligned video frames are processed by a server processor to extracttime-aligned visual metadata associated with the source media.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid method generates digital advertising based on one or moretime-aligned textual metadata associated with the source media.

In accordance with an exemplary embodiment of the claimed invention, acomputer based method is provided for converting and extracting metadatafrom a source media comprising both audio and video streams. Aprocessor-based server extracts the audio stream from the source mediaand normalizes the audio stream to a single universal amplitude scale bygenerating an audio histogram. The processor-based server determines aloudest frame of the audio stream with the loudest sound and a softestframe of the audio stream with the softest sound. A normalized minimumamplitude value is assigned to the softest frame and a normalizedmaximum amplitude value is assigned to the loudest frame. Each frame ofthe audio stream is then compared to the loudest frame and to thesoftest frame by utilizing the audio histogram, and assigned anormalized amplitude value between the normalized minimum and maximumvalues in accordance with the comparison result. A video frame engine ofa processor-based server extracts the video stream from the source mediaand processes the time-aligned video frames to extract time-alignedvisual metadata associated with the source media. The normalizedamplitude value, time-aligned video frames, and time-aligned visualmetadata are stored in a database.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid method performs an optical character recognition (OCR)analysis on the time-aligned video frames by the server processor toextract time-aligned OCR metadata. Texts are extracted from graphics bya timed interval from the time-aligned video frames. Database lookupsare preformed based on a dataset of predefined recognized fonts, lettersand languages stored in the database. One or more matched time-alignedOCR metadata is received from the database by the server processor.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid method performs a facial recognition analysis on thetime-aligned video frames by the server processor to extracttime-aligned facial recognition metadata. Facial data points areextracted by a timed interval from the time-aligned video frames.Database lookups are performed based on a dataset of predefined facialdata points for individuals stored in the database. One or more matchedtime-aligned facial metadata is received from the database by the serverprocessor.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid method performs an object recognition analysis on thetime-aligned video frames by the server processor to extracttime-aligned object recognition metadata. Object data points areextracted by a timed interval from the time-aligned video frames.Database lookups are performed based on a dataset of predefined objectdata points for a plurality of objects stored in the database. One ormore matched time-aligned object metadata is received from the databaseby the server processor.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid method performs at least one of the following: an opticalcharacter recognition (OCR) analysis on the time-aligned video frames bythe server processor to extract time-aligned OCR metadata; a facialrecognition analysis on the time-aligned video frames by the serverprocessor to extract time-aligned facial recognition metadata; and anobject recognition analysis on the time-aligned video frames by theserver processor to extract time-aligned object recognition metadata.

In accordance with an exemplary embodiment of the claimed invention, anon-transitory computer readable medium comprising computer executablecode for transcribing and extracting metadata from a source media isprovided. A processor-based server is instructed to extract an audiostream from the source media and to normalize the audio stream to asingle universal amplitude scale by generating an audio histogram. Theprocessor-based server is instructed to determine a loudest frame of theaudio stream with the loudest sound and a softest frame of the audiostream with the softest sound. The processor-based server is instructedto assign a normalized minimum amplitude value to the softest frame anda normalized maximum amplitude value to the loudest frame. Each frame ofthe audio stream is then compared to the loudest frame and to thesoftest frame by utilizing the audio histogram, and assigned anormalized amplitude value between the normalized minimum and maximumvalues in accordance with the comparison result. A database isinstructed to store the normalized amplitude value for each frame of theaudio stream.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid computer executable code further comprises instructions for aspeech recognition engine to process the audio stream to transcribe theaudio stream into a time-aligned textual transcription to provide atime-aligned machine transcribed media. The server processor isinstructed to process the time-aligned machine transcribed media toextract time-aligned textual metadata associated with the source media.The database is instructed to store the time-aligned machine transcribedmedia and the time-aligned textual metadata.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid computer executable code further comprises instructions forperforming a textual sentiment analysis on a full or a segment of thetime-aligned textual transcription by the server processor to extracttime-aligned sentiment metadata. Database lookups are performed based onpredefined sentiment weighed texts stored in the database. One or morematched time-aligned sentiment metadata is received from the database bythe server processor.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid computer executable code further comprises instructions forperforming a natural language processing on a full or a segment of thetime-aligned textual transcription by the server processor to extracttime-aligned natural language processed metadata related to at least oneof the following: an entity, a topic, a key theme, a subject, anindividual, and a place. Database lookups are performed based onpredefined natural language weighed texts stored in the database. One ormore matched time-aligned natural language metadata is received from thedatabase by the server processor.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid computer executable code further comprises instructions forperforming a demographic estimation processing on a full or a segment ofthe time-aligned textual transcription by the server processor toextract time-aligned demographic metadata. Database lookups areperformed based on predefined word/phrase demographic associationsstored in the database. One or more matched time-aligned demographicmetadata is received from the database by the server processor.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid computer executable code further comprises instructions forperforming a psychological profile estimation processing on a full or asegment of the time-aligned textual transcription by the serverprocessor to extract time-aligned psychological metadata. Databaselookups are performed based on predefined word/phrase psychologicalprofile associations stored in the database. One or more matchedtime-aligned psychological metadata from the database by the serverprocessor.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid computer executable code further comprises instructions forgenerating digital advertising based on one or more time-aligned textualmetadata associated with the source media.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid computer executable code further comprises instructions forextracting a video stream from the source media by a video frame engineof a processor-based server. Time-aligned video frames are extractedfrom the video stream by the video frame engine. The time-aligned videoframes are stored in the database. The time-aligned video frames areprocessed by a server processor to extract time-aligned visual metadataassociated with the source media.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid computer executable code further comprises instructions foroptical character recognition (OCR) analysis on the time-aligned videoframes by the server processor to extract time-aligned OCR metadata.Texts are extracted from graphics by a timed interval from thetime-aligned video frames. Database lookups are performed based on adataset of predefined recognized fonts, letters and languages stored inthe database. One or more matched time-aligned OCR metadata from thedatabase by the server processor.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid computer executable code further comprises instructions forperforming a facial recognition analysis on the time-aligned videoframes by the server processor to extract time-aligned facialrecognition metadata. Facial data points are extracted by a timedinterval from the time-aligned video frames. Database lookups areperformed based on a dataset of predefined facial data points forindividuals stored in the database. One or more matched time-alignedfacial metadata is received from the database by the server processor.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid computer executable code further comprises instructions forperforming an object recognition analysis on the time-aligned videoframes by the server processor to extract time-aligned objectrecognition metadata. Object data points are extracted by a timedinterval from the time-aligned video frames. Database lookups areperformed based on a dataset of predefined object data points for aplurality of objects stored in the database. One or more matchedtime-aligned object metadata is received from the database by the serverprocessor.

In accordance with an exemplary embodiment of the claimed invention, asystem for transcribing and extracting metadata from a source media isprovided. A processor based server is connected to a communicationssystem for receiving and extracting an audio stream from the sourcemedia. A processor-based server extracts an audio stream from the sourcemedia and normalizes the audio stream to a single universal amplitudescale by generating an audio histogram. The processor-based serverdetermines a loudest frame of the audio stream with the loudest soundand a softest frame of the audio stream with the softest sound. Anormalized minimum amplitude value is assigned to the softest frame anda normalized maximum amplitude value is assigned to the loudest frame.Each frame of the audio stream is then compared to the loudest frame andto the softest frame by utilizing the audio histogram, and assigned anormalized amplitude value between the normalized minimum and maximumvalues in accordance with the comparison result. The normalizedamplitude value is stored for each frame of the audio stream in adatabase.

In accordance with an exemplary embodiment of the claimed invention, aspeech recognition engine of the aforesaid server process the audiostream to transcribe the audio stream into a time-aligned textualtranscription, thereby providing a time-aligned machine transcribedmedia. A server processor processes the time-aligned machine transcribedmedia to extract time-aligned textual metadata associated with thesource media. A database stores the time-aligned machine transcribedmedia and the time-aligned textual metadata associated with the sourcemedia.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid server processor performs a textual sentiment analysis on afull or a segment of the time-aligned textual transcription to extracttime-aligned sentiment metadata. The server processor performs databaselookups based on predefined sentiment weighed texts stored in thedatabase, and receives one or more matched time-aligned sentimentmetadata from the database.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid server processor performs a natural language processing on afull or a segment of the time-aligned textual transcription to extracttime-aligned natural language processed metadata related to at least oneof the following: an entity, a topic, a key theme, a subject, anindividual, and a place. The server processor performs database lookupsbased on predefined natural language weighed texts stored in thedatabase, and receives one or more matched time-aligned natural languagemetadata from the database.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid server processor performs a demographic estimation processingon a full or a segment of the time-aligned textual transcription toextract time-aligned demographic metadata. The server processor performsdatabase lookups based on predefined word/phrase demographicassociations stored in the database, and receives one or more matchedtime-aligned demographic metadata from the database.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid server processor performs a psychological profile estimationprocessing on a full or a segment of the time-aligned textualtranscription to extract time-aligned psychological metadata. The serverprocessor performs database lookups based on predefined word/phrasepsychological profile associations stored in the database, and receivesone or more matched time-aligned psychological metadata from thedatabase.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid server comprises a video frame engine for extracting a videostream from the source media. The server processor extracts time-alignedvideo frames from the video stream and process the time-aligned videoframes to extract time-aligned visual metadata associated with thesource media. The database stores the time-aligned video frames.

In accordance with an exemplary embodiment of the claimed invention, theaforesaid server processor performs one or more of the followinganalysis on the time-aligned video frames: an optical characterrecognition (OCR) analysis to extract time-aligned OCR metadata; afacial recognition analysis to extract time-aligned facial recognitionmetadata; and an object recognition analysis to extract time-alignedobject recognition metadata. The server processor performs the OCRanalysis by extracting texts from graphics by a timed interval from thetime-aligned video frames; performing database lookups based on adataset of predefined recognized fonts, letters and languages stored inthe database; and receiving one or more matched time-aligned OCRmetadata from the database. The server processor performs a facialrecognition analysis by extracting facial data points by a timedinterval from the time-aligned video frames; performing database lookupsbased on a dataset of predefined facial data points for individualsstored in the database; and receiving one or more matched time-alignedfacial metadata from the database. The server processor performs anobject recognition analysis by extracting object data points by a timedinterval from the time-aligned video frames; performing database lookupsbased on a dataset of predefined object data points for a plurality ofobjects stored in the database; and receiving one or more matchedtime-aligned object metadata from the database by the server processor.

Various other objects, advantages and features of the present inventionwill become readily apparent from the ensuing detailed description, andthe novel features will be particularly pointed out in the appendedclaims.

BRIEF DESCRIPTION OF THE DRAWINGS

The following detailed description, given by way of example, and notintended to limit the present invention solely thereto, will best beunderstood in conjunction with the accompanying drawings in which:

FIG. 1 is a block diagram of the system architecture in accordance withan exemplary embodiment of the claimed invention;

FIG. 2A is a block diagram of a client device in accordance with anexemplary embodiment of the claimed invention;

FIG. 2B is a block diagram of a server in accordance with an exemplaryembodiment of the claimed invention;

FIG. 3 is a flowchart of an exemplary process for transcribing andanalyzing non-transcribed video/audio stream in accordance with anexemplary embodiment of the claimed invention;

FIG. 4 is a flowchart of an exemplary process for real-time or postprocessed server analysis and metadata extraction of machine transcribedmedia in accordance with an exemplary embodiment of the claimedinvention;

FIG. 5 is a flow chart of an exemplary process for real-time or postprocessed audio amplitude analysis of machine transcribed media inaccordance with an exemplary embodiment of the claimed invention;

FIG. 6 is a flow chart of an exemplary process for real-time or postprocessed sentiment server analysis of machine transcribed media inaccordance with an exemplary embodiment of the claimed invention;

FIG. 7 is a flow chart of an exemplary process for real-time or postprocessed natural language processing analysis of machine transcribedmedia in accordance with an exemplary embodiment of the claimedinvention;

FIG. 8 is a flow chart of an exemplary process for real-time or postprocessed demographic estimation analysis of machine transcribed mediain accordance with an exemplary embodiment of the claimed invention;

FIG. 9 is a flow chart of an exemplary process for real-time or postprocessed psychological profile estimation server analysis of machinetranscribed media in accordance with an exemplary embodiment of theclaimed invention;

FIG. 10 is a flow chart of an exemplary process for real-time or postprocessed optical character recognition server analysis of machinetranscribed media in accordance with an exemplary embodiment of theclaimed invention;

FIG. 11 is a flow chart of an exemplary process for real-time or postprocessed facial recognition analysis of machine transcribed media inaccordance with an exemplary embodiment of the claimed invention; and

FIG. 12 is a flow chart of an exemplary process for real-time or postprocessed object recognition analysis of machine transcribed media inaccordance with an exemplary embodiment of the claimed invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

As shown in FIG. 1, at the system level, the claimed invention comprisesone or more web-enabled processor based client devices 200, one or moreprocessor based servers 100, and a communications network 300 (e.g.,Internet). In accordance with an exemplary embodiment of the claimedinvention, as shown in FIG. 2A, each client device 200 comprises aprocessor or client processor 210, a display or screen 220, an inputdevice 230 (which can be the same as the display 220 in the case oftouch screens), a memory 240, a storage device 250 (preferably, apersistent storage, e.g., hard drive), and a network connection facility260 to connect to the communications network 300.

In accordance with an exemplary embodiment of the claimed invention, theserver 100 comprise a processor or server processor 110, a memory 120, astorage device 130 (preferably a persistent storage, e.g., hard disk,database, etc.), a network connection facility 140 to connect to thecommunications network 300, a speech recognition engine 150 and a videoframe engine 160.

The network enabled client device 200 includes but is not limited to acomputer system, a personal computer, a laptop, a notebook, a netbook, atablet or tablet like device, an IPad® (IPAD is a registered trademarkof Apple Inc.) or IPad like device, a cell phone, a smart phone, apersonal digital assistant (PDA), a mobile device, or a television, orany such device having a screen connected to the communications network300 and the like.

The communications network 300 can be any type of electronictransmission medium, for example, including but not limited to thefollowing networks: a telecommunications network, a wireless network, avirtual private network, a public internet, a private internet, a secureinternet, a private network, a public network, a value-added network, anintranet, a wireless gateway, or the like. In addition, the connectivityto the communications network 300 may be via, for example, by cellulartransmission, Ethernet, Token Ring, Fiber Distributed DatalinkInterface, Asynchronous Transfer Mode, Wireless Application Protocol, orany other form of network connectivity.

Moreover, in accordance with an embodiment of the claimed invention, thecomputer-based methods for implementing the claimed invention areimplemented using processor-executable instructions for directingoperation of a device or devices under processor control, theprocessor-executable instructions can be stored on a tangiblecomputer-readable medium, such as but not limited to a disk, CD, DVD,flash memory, portable storage or the like. The processor-executableinstructions can be accessed from a service provider's website or storedas a set of downloadable processor-executable instructions, for exampleor downloading and installation from an Internet location, e.g. theserver 100 or another web server (not shown).

Turning now to FIG. 3, there is illustrated a flow chart describing theprocess of converting, extracting metadata and analyzing theuntranscribed data in real-time or post-processing in accordance with anexemplary embodiment of the claimed invention. Untranscribed digitaland/or non-digital source data, such as printed and analog mediastreams, are received by the server 100 and stored in the database 130at step 300. These streams can represent digitized/undigitized archivedaudio, digitized/undigitized archived video, digitized/undigitizedarchived images or other audio/video formats. The server processor 110distinguishes or sorts the type of media received into at least printednon-digital content at step 301 and audio/video/image media at step 302.The server processor 110 routes the sorted media to the appropriatemodule/component for processing and normalization.

A single or cluster of servers or transcription servers 100 processesthe media input and extracts relevant metadata at step 303. Data (ormetadata) is extracted by streaming digital audio or video content intoa server processor 110 running codecs which can read the data streams.In accordance with an exemplary embodiment of the claimed invention, theserver processor 110 applies various processes to extract the relevantmetadata.

Turning now to FIG. 4, there is illustrated a real-time orpost-processed server analysis and metadata extraction machinetranscribed media. The server processor 110 extracts audio stream fromthe source video/audio file at step 400. The speech recognition engine150 executes or applies speech to text conversion processes, e.g.,speech recognition process, on the audio and/or video streams totranscribe the audio/video stream into textual data, preferablytime-aligned textual data or transcription at step 304. The time-alignedtextual transcription and metadata are stored in a database 130 or hardfiles at step 308. Preferably, each word in the transcription is given astart/stop timestamp to help locate the word via server based searchinterfaces.

In accordance with an exemplary embodiment of the claimed invention, theserver processor 110 performs real-time or post processed audioamplitude analysis of machine transcribed media comprising an audiostream and/or a video stream, preferably time-aligned audio framesand/or time-aligned video frames. The server processor 110 extractsaudio frame metadata from the extracted audio stream at step 306 andexecutes an amplitude extraction processing on the extracted audio framemetadata at step 410. The audio frame engine 170 extracts the audiostream from the source video/audio file at step 400. The audio frameengine 160 executes or applies audio frame extraction on the audiostreams to transcribe the audio stream into time-aligned audio frames atstep 306. The time-aligned audio frames are stored in a database 130 orhard files at step 355. The audio metadata extraction processing isfurther described in conjunction with FIG. 5 illustrating a real-time orpost processed audio amplitude analysis of machine transcribed media inaccordance with an exemplary embodiment of the claimed invention. Theserver processor 110 stores the extracted audio frame metadata,preferably time-aligned audio metadata associated the source media, inthe database 130 at step 355. The server processor 110 extracts audioamplitude by a timed interval from the stored time-aligned audio framesat step 412 and measures an aural amplitude of the extracted audioamplitude at step 413. The server processor 110 then assigns a numericalvalue to the extracted amplitude at step 414. If the server processor110 successfully extracts and processes the audio amplitude, then theserver processor 110 stores the time aligned aural amplitude metadata inthe database 130 at step 415 and proceeds to the next timed interval ofthe time-aligned audio frames for processing. If the server processor110 is unable to successfully extract and process the audio amplitudefor a given extracted time-aligned audio frame, then server processor110 rejects the current timed interval of timed-aligned audio frames andproceeds to the next timed interval of the time-aligned audio frames forprocessing.

After processing the last time interval of the stored time-aligned audioframes, the server processor 110 generates audio histogram of the audiofile/data of the untranscribed digital and/or non-digital source data atstep 417. In accordance with an exemplary embodiment of the claimedinvention, the server processor 110 normalizes the entire audio file bydetermining the loudest and softest sounds within the audio file byframe, by predetermined time (e.g., second) or by other temporalapproach at step 418. The server processor 110 assigns a normalizedminimum amplitude value (e.g., a value of zero in an amplitude scale of0-100 or an amplitude scale of 0-10) to the softest frame (the framewith the softest sound) and a normalized maximum amplitude value (e.g.,a value of 100 in an amplitude scale of 0-100 or a value of 10 in anamplitude scale of 0-10) to the loudest frame (the frame with theloudest sound) at step 418. The server processor compares each frame ofthe audio file to the loudest and softest frames in the audio file byutilizing the audio histogram and assigns a relative value therebetween. That is, the logarithmic dB values of each frame arenormalized/transformed into values within a single universal amplitudescale, thereby enabling user to perform a universal search based on thesound level. The relative value of the each frame of the audio file isstored in the database 130 at step 419. Preferably, the server processor110 assigns a normalized amplitude value between the normalized minimumamplitude value and the normalized maximum amplitude value (e.g., anamplitude value of 1-99 from an amplitude scale of 0-100) to each framein accordance with a result of the comparison at step 600. It isappreciated that the amplitude scale can be a numerical scale or analphanumerical scale. The normalized amplitude value assigned to eachframe of the audio file is stored in the database 130 at step 601. Thatis, the server processor 110 maps the numerical value of the respectiveframe to a single, normalized, universal amplitude scale (e.g., anamplitude scale of 0-100) at step 600, thereby enabling the user tosearch across all files with a standard database query. That is, theserver processor 120 assigns the same value to the frame with theloudest sound in a undersampled (overly quiet) audio file and to framewith the loudest sound in an oversampled (overly loud) audio file. Inaccordance with an exemplary embodiment of the claimed invention, asadditional audio files are processed, the server processor 110normalizes all of the processed audio files into a single universalamplitude scale. This makes all of the time-aligned metadata toqueryable or searchable using a standard database query. That is, all ofthe time-aligned metadata can be search/associated against all othermetadata.

Since sound levels from different media sources may differ even for thesame event, the media files are transformed or normalized to a commonstandard or a single universal amplitude scale, That is, each audioframe of the audio file is mapped to a single universal amplitude scale,thereby rendering all metadata queryable or searchable using a standarddatabase query. In accordance with an exemplary embodiment of theclaimed invention, the server processor 110 maps each frame of the mediafile to the normalized amplitude scale based on Euclidian distancebetween the logarithmic dB value of the frame and the logarithmic dBvalues associated with each amplitude value of the normalized amplitudescale. The server processor 110 assigns each frame of the media file theamplitude value on the normalized amplitude scale yielding the lowestEuclidian distance, thereby establishing a normalized amplitude valuefor each frame of a plurality of media files. This advantageouslyenables the claimed system to normalize, codify and search the mediafiles for metadata based on a sound level.

Following the consumption of normalized amplitude level of the frames ofthe media files, the server processor 110 performs a number of metadataextraction and processing, as shown in FIGS. 3, 4 and 6-12, which aremore fully described herein. The server processor 110 performs textualsentiment processing 420, natural language processing 430, demographicestimation processing 440, psychological profile processing 450, opticalcharacter recognition processing 510, facial recognition processing 520and object recognition processing 530.

Turning to FIG. 3, in accordance with an exemplary embodiment of theclaimed invention, the server processor 110 executes the textualmetadata extraction process on the transcribed data or transcript of theextracted audio stream, preferably time-aligned textual transcription,to analyze and extract metadata relating to textual sentiment, naturallanguage processing, demographics estimation and psychological profileat step 307. The extracted metadata, preferably time-aligned metadataassociated with source video/audio files are stored in the database ordata warehouse 130. For example, the server processor 110 analyzes orcompares either the entire transcript or a segmented transcript to apredefined sentiment weighted text for a match. When a match is found,the server processor 110 stores the time-aligned metadata associatedwith the source media in the database 130. The server processor 110 canexecute one or more application program interface (API) servers tosearch the stored time-aligned metadata in the data warehouse 130 inresponse to user search query or data request.

In accordance with an exemplary embodiment of the claimed invention, asshown in FIG. 4, the server processor 110 performs real-time or postprocessed sentiment analysis of machine transcribed media at step 307.The server processor 110 performs a textual sentiment processing oranalysis on the stored time-aligned textual transcription to extractsentiment metadata, preferably time-aligned sentiment metadata, at step420. The textual sentiment processing is further described inconjunction with FIG. 6 illustrating a real-time or post processedsentiment server analysis of machine transcribed media in accordancewith an exemplary embodiment of the claimed invention. The serverprocessor 110 analyzes the entire transcript for sentiment relatedmetadata at step 421, preferably the entire transcript is selected foranalysis based on the user search query or data request. Alternatively,the server processor 110 analyzes a segmented transcript for sentimentrelated metadata at step 422, preferably the segmented transcript isselected for analysis based on the user search query or data request.The server processor 110 performs database lookups based on thepredefined sentiment weighed text stored in the sentiment database 424at step 423. It is appreciated that the predefined sentiment weighedtext can be alternatively or additionally stored in the data warehouse130, and the database lookups can be performed against the datawarehouse 130 or against a separate sentiment database 424. Thesentiment database 424 or data warehouse 130 returns the matchedsentiment metadata, preferably time-aligned sentiment metadata, to theserver processor 110 if a match is found at step 425. The serverprocessor 110 stores the time-aligned textual sentiment metadata in thedata warehouse 130 at step 426.

For example, the server processor 110 processes a particular sentence inthe transcribed text, such as “The dog attacked the owner viciously,while appearing happy”. In accordance with an exemplary embodiment ofthe claimed invention, the server processor 110 extract each word of thesentence via a programmatic function, and removes “stop words”. Stopwords can be common words, which typically evoke no emotion or meaning,e.g., “and”, “or”, “in”, “this”, etc. The server processor 110 thenidentifies adjectives, adverbs and verbs in the queried sentence. Usingthe database 130, 424 containing numerical positive/negative values foreach word containing emotion/sentiment, the server processor 110 appliesan algorithm to determine the overall sentiment of the processed text.In this exemplary case, the server processor 110 assigns the followingnumerical values to various words in the queried sentence: the word“attacked” is assigned or weighed a value between 3-4 on a 1-5 negativescale, the word “viciously” is assigned a value between 4-5 on a 1-5negative scale, the word “happy” is assigned a value between 2-3 on a1-5 positive scale. The server processor 110 determines an weightedaverage score of the queried sentence from each individual valueassigned to the various words of the queried sentence.

In accordance with an exemplary embodiment of the claimed invention, asshown in FIG. 4, the server processor 110 performs real-time or postprocessed natural language analysis of machine transcribed media at step307. The server processor 110 performs a natural language processing oranalysis on the stored time-aligned textual transcription to extractnatural language processed metadata related to entity, topic, keythemes, subjects, individuals, people, places, things and the like atstep 430. Preferably, the server processor 110 extracts time-alignednatural language processed metadata. The natural language processing isfurther described in conjunction with FIG. 7 illustrating a real-time orpost processed natural language processing analysis of machinetranscribed media in accordance with an exemplary embodiment of theclaimed invention. The server processor 110 analyzes the entiretranscript for natural language processed metadata at step 431,preferably the entire transcript is selected for analysis based on theuser search query or data request. Alternatively, the server processor110 analyzes a segmented transcript for the natural language processedmetadata at step 432, preferably the segmented transcript is selectedfor analysis based on the user search query or data request. The serverprocessor 110 performs database lookups based on the predefined naturallanguage weighed text stored in the natural language database 434 atstep 433. It is appreciated that the predefined natural language weighedtext can be alternatively or additionally stored in the data warehouse130, and the database lookups can be performed against the datawarehouse 130 or against a separate natural language database 434. Thenatural language database 434 or data warehouse 130 returns the matchednatural language processed metadata, preferably time-aligned naturallanguage processed metadata, to the server processor 110 if a match isfound at step 435. The server processor 110 stores the time-alignednatural language processed metadata in the data warehouse 130 at step436.

In accordance with an exemplary embodiment of the claimed invention, theserver processor 110 queries the transcribed text, preferably by eachextracted sentence, against the database warehouse 130 and/or naturallanguage database 434 via an API or other suitable interface todetermine the entity and/or topic information. That is, the serverprocessor 110 analyzes each sentence or each paragraph of thetranscribed text and extracts known entities and topics based on thelanguage analysis. In accordance with an exemplary embodiment of theclaimed invention, the server processor 110 compares the words andphrases in the transcribed text against the database 130, 434 containingwords categorized by entity and topics. An example of an entity can bean individual, person, place or thing (noun). An example of a topic canbe politics, religion or other more specific genres of discussion.

In accordance with an exemplary embodiment of the claimed invention, asshown in FIG. 4, the server processor 110 performs real-time or postprocessed demographic estimation server analysis of machine transcribedmedia at step 307. The server processor 110 performs a demographicestimation processing or analysis on the stored time-aligned textualtranscription to extract demographic metadata, preferably time-aligneddemographic metadata, at step 440. The demographic estimation processingis further described in conjunction with FIG. 8 illustrating a real-timeor post processed demographic estimation server analysis of machinetranscribed media in accordance with an exemplary embodiment of theclaimed invention. The server processor 110 analyzes the entiretranscript for demographic metadata at step 441, preferably the entiretranscript is selected for analysis based on the user search query ordata request. Alternatively, the server processor 110 analyzes asegmented transcript for the demographic metadata at step 442,preferably the segmented transcript is selected for analysis based onthe user search query or data request. The server processor 110 performsdatabase lookups based on the predefined word/phrase demographicassociations stored in the demographic database 444 at step 443. It isappreciated that the predefined word/phrase demographic associations canbe alternatively or additionally stored in the data warehouse 130, andthe database lookups can be performed against the data warehouse 130 oragainst a separate demographic database 444. The demographic database444 or data warehouse 130 returns the matched demographic metadata,preferably time-aligned demographic metadata, to the server processor110 if a match is found at step 445. The server processor 110 stores thetime-aligned demographic metadata in the data warehouse 130 at step 446.

In accordance with an exemplary embodiment of the claimed invention, theserver processor 110 queries the source of the transcribed data (e.g. aspecific television show) against the database warehouse 130 and/ordemographic database 444 via an API or other suitable interface todetermine the demographic and/or socio-demographic information. Thedatabase 130, 444 contains ratings information of the source audio/videomedia from which the server processor 110 extracted the transcription.Examples of such sources are a broadcast television, an internet videoand/or audio, broadcast radio and the like.

In accordance with an exemplary embodiment of the claimed invention, theserver 100 employs a web scraping service to extract open source, freelyavailable information from a wide taxonomy of web-based texts. Thesetexts, when available via open-source means are stored within thedatabase 130, 444 and classified by their category (e.g., finance,sports/leisure, travel, and the like). For example, the server processor110 can classified these texts into twenty categories. Using open sourcetools and public information, the server processor 110 extracts commondemographics for these categories. When a blob of text is inputted intothe system (or received by the server 100), the server processor 110weighs the totality of the words to determine which taxonomy of textmost accurately reflects the text being analyzed within the system. Forexample, “In 1932, Babe Ruth hits 3 home runs in Yankee Stadium” willlikely have a 99% instance of being in the sports/baseball taxonomy orbeing categorized into the sports/leisure category by the serverprocessor 110. Thereafter, the server processor 110 determines the agerange percentages, gender percentages based upon stored demographicaldata in the demographic database 444 and/or the data warehouse 130.

In accordance with an exemplary embodiment of the claimed invention, asshown in FIG. 4, the server processor 110 performs real-time or postprocessed psychological profile estimation server analysis of machinetranscribed media at step 307. The server processor 110 performs apsychological profile processing or analysis on the stored time-alignedtextual transcription to extract psychological metadata, preferablytime-aligned psychological metadata, at step 450. The psychologicalprofile processing is further described in conjunction with FIG. 9illustrating a real-time or post processed psychological profileestimation server analysis of machine transcribed media in accordancewith an exemplary embodiment of the claimed invention. The serverprocessor 110 analyzes the entire transcript for psychological metadataat step 451, preferably the entire transcript is selected for analysisbased on the user search query or data request. Alternatively, theserver processor 110 analyzes a segmented transcript for thepsychological metadata at step 452, preferably the segmented transcriptis selected for analysis based on the user search query or data request.The server processor 110 performs database lookups based on thepredefined word/phrase psychological profile associations stored in thepsychological database 454 at step 453. It is appreciated that thepredefined word/phrase psychological profile associations can bealternatively or additionally stored in the data warehouse 130, and thedatabase lookups can be performed against the data warehouse 130 oragainst a separate psychological database 454. The psychologicaldatabase 454 or data warehouse 130 returns the matched psychologicalmetadata, preferably time-aligned psychological metadata, to the serverprocessor 110 if a match is found at step 455. The server processor 110stores the time-aligned psychological metadata in the data warehouse 130at step 456.

In accordance with an exemplary embodiment of the claimed invention, theserver processor 110 processes each sentence of the transcribed text.The server processor extracts each word from a given sentence andremoves the stop words, as previously described herein with respect tothe sentiment metadata. The server processor 110 applies an algorithm toeach extracted words and associates each extracted word back to thedatabase 130, 454 containing values of “thinking” or “feeling” for thatspecific word. That is, in accordance with an exemplary embodiment ofthe claimed invention, the server processor 110 categorizes eachextracted word into one of three categories: 1) thinking; 2) feeling;and 3) not relevant, e.g., stop words. It is appreciated that theclaimed invention is not limited to sorting the words into these threecategories, more than three categories can be utilized. Use of these twospecific word categories (thinking and feeling) is a non-limitingexample to provide a simplified explanation of the claimed psychologicalprofile estimation processing. A word associated with logic, principlesand rules falls within the “thinking” category, and the server processor110 extracts and sums an appropriate weighted 1-5 numerical value forthat “thinking” word. The same method is performed for words in the“feeling” category. Words associated or related to values, beliefs andfeelings fall within the “feeling” category, and are similarly assignedan appropriate weighted 1-5 numerical value. The server processor 110sums these weighted values in each respective category and determines aweighted average value for each sentence, a segmented transcript orentire transcript. It is appreciated that the server processor 110 usessimilar approach for a variety of psychological profile types,extroverted or introverted, sensing/intuitive, perceiving/judging andother.

Turning to FIG. 3, in accordance with an exemplary embodiment of theclaimed invention, the server processor 110 executes the visual metadataextraction process on the transcribed data or transcript of theextracted video stream, preferably time-aligned video frames, to analyzeand extract metadata relating to optical character recognition, facialrecognition and object recognition at step 305. The extracted metadata,preferably time-aligned metadata associated with the source video filesare stored in the database or data warehouse 130. The video frame engine160 extracts video stream from the source video/audio file at step 500.The video frame engine 160 executes or applies video frame extraction onthe video streams to transcribe the video stream into time-aligned videoframes at step 305. The time-aligned video frames are stored in adatabase 130 or hard files at step 308.

Turning to FIG. 4, the server processor 110 extracts video framemetadata from the extracted video stream and executes the visualmetadata extraction process on the extracted time-aligned video framesat step 305. The server processor 110 can execute one or moreapplication program interface (API) servers to search the storedtime-aligned metadata in the data warehouse 130 in response to usersearch query or data request.

In accordance with an exemplary embodiment of the claimed invention, asshown in FIG. 4, the server processor 110 performs real-time or postprocessed optical character recognition server analysis of machinetranscribed media at step 305. The server processor 110 performs anoptical character recognition (OCR) processing or analysis on the storedtime-aligned video frames to extract OCR metadata, preferablytime-aligned OCR metadata, at step 510. The OCR metadata extractionprocessing is further described in conjunction with FIG. 10 illustratinga real-time or post processed optical character recognition serveranalysis of machine transcribed media in accordance with an exemplaryembodiment of the claimed invention. The video frame engine 160 storesthe extracted video frame metadata, preferably time-aligned video framesassociated the source media, in the database 130 at step 356. The serverprocessor 110 extracts text from graphics by timed interval from thestored time-aligned video frames at step 511. The server processor 110performs database lookups based on a dataset of predefined recognizedfonts, letters, languages and the like stored in the OCR database 513 atstep 512. It is appreciated that the dataset of predefined recognizedfonts, letters, languages and the like can be alternatively oradditionally stored in the data warehouse 130, and the database lookupscan be performed against the data warehouse 130 or against a separateOCR database 513. The OCR database 513 or data warehouse 130 returns thematched OCR metadata, preferably time-aligned OCR metadata, to theserver processor 110 if a match at the timed interval is found at step514. The server processor 110 stores the time-aligned OCR metadata inthe data warehouse 130 at step 515 and proceeds to the next timedinterval of the time-aligned video frames for processing. If the serverprocessor 110 is unable to find a match for a given timed interval ofthe time-aligned video frame, then server processor 110 skips thecurrent timed interval of timed-aligned video frames and proceeds to thenext timed interval of the time-aligned video frames for processing.

In accordance with an exemplary embodiment of the claimed invention, asshown in FIG. 4, the server processor 110 performs real-time or postprocessed facial recognition analysis of machine transcribed media atstep 305. The server processor 110 performs a facial recognitionprocessing or analysis on the stored time-aligned video frames toextract facial recognition metadata, preferably time-aligned facialrecognition metadata, at step 520. The facial recognition metadatacomprises but is not limited to emotional, gender and the like. Thefacial recognition metadata extraction processing is further describedin conjunction with FIG. 11 illustrating a real-time or post processedfacial recognition analysis of machine transcribed media in accordancewith an exemplary embodiment of the claimed invention. The video frameengine 160 stores the extracted video frame metadata, preferablytime-aligned video frames associated the source media, in the database130 at step 356. The server processor 110 extracts facial data points bytimed interval from the stored time-aligned video frames at step 521.The server processor 110 performs database lookups based on a dataset ofpredefined facial data points for individuals, preferably for variouswell-known individuals, e.g., celebrities, politicians, newsmaker, etc.,stored in the facial database 523 at step 522. It is appreciated thatthe dataset of predefined facial data points can be alternatively oradditionally stored in the data warehouse 130, and the database lookupscan be performed against the data warehouse 130 or against a separatefacial database 523. The facial database 523 or data warehouse 130returns the matched facial recognition metadata, preferably time-alignedfacial recognition metadata, to the server processor 110 if a match atthe timed interval is found at step 524. The server processor 110 storesthe time-aligned facial recognition metadata in the data warehouse 130at step 525 and proceeds to the next timed interval of the time-alignedvideo frames for processing. If the server processor 110 is unable tofind a match for a given timed interval of the time-aligned video frame,then server processor 110 skips the current timed interval oftimed-aligned video frames and proceeds to the next timed interval ofthe time-aligned video frames for processing.

In accordance with an exemplary embodiment of the claimed invention, theserver processor 110 or a facial recognition server extracts faces fromthe transcribed video/audio and matches each of the extracted faces toknown individuals or entities stored in the facial database 523 and/orthe data warehouse 130. The server processor 110 also extracts andassociates these matched individuals back to the extracted transcribedtext, preferably down to the second/millisecond, to facilitate searchingby individual and transcribed text simultaneously. The system, or morespecifically the server 100, maintains thousands of trained filescontaining the most common points on a human face. In accordance with anexemplary embodiment of the claimed invention, the server processor 110extracts eyes, (all outer points and their angles), mouth (all outerpoints and their angles), nose (all outer points and their angles) andthe x, y coordinates of these features from the time-aligned videoframes and compares/matches the extracted features to the stored facialfeatures (data points) of known individuals and/or entities in thefacial database 523 and/or data warehouse 130. It is appreciated thatthe number of data points is highly dependent on the resolution of thefile, limited by the number of pixels. These data points create a“fingerprint” like overlay of an individual's face, at which point it iscompared with the pre-analyzed face “fingerprints” already stored in alocal or external database, e.g., the server database 130, the facialdatabase 523 and/or the client storage 250. For certain application, theclient storage/database 250 may contain a limited set of pre-analyzedface fingerprints for faster processing. For a large scale search, theserver processor 110 returns a list of the 10 most probable candidates.For a small scale search of a trained 1000 person database, the searchaccuracy of the claimed invention can reach near 100%.

In accordance with an exemplary embodiment of the claimed invention, asshown in FIG. 4, the server processor 110 performs real-time or postprocessed object recognition analysis of machine transcribed media atstep 305. The server processor 110 performs an object recognitionprocessing or analysis on the stored time-aligned video frames toextract object recognition metadata, preferably time-aligned objectrecognition metadata, at step 530. The object recognition metadataextraction processing is further described in conjunction with FIG. 12illustrating a real-time or post processed object recognition analysisof machine transcribed media in accordance with an exemplary embodimentof the claimed invention. The video frame engine 160 stores theextracted video frame metadata, preferably time-aligned video framesassociated the source media, in the database 130 at step 356. The serverprocessor 110 extracts object data points by timed interval from thestored time-aligned video frames at step 531. The server processor 110performs database lookups based on a dataset of predefined object datapoints stored in the object database 533 at step 532. It is appreciatedthat the dataset of predefined object data points can be alternativelyor additionally stored in the data warehouse 130, and the databaselookups can be performed against the data warehouse 130 or against aseparate object database 533. The object database 533 or data warehouse130 returns the matched object recognition metadata, preferablytime-aligned object recognition metadata, to the server processor 110 ifa match at the timed interval is found at step 534. The server processor110 stores the time-aligned object recognition metadata in the datawarehouse 130 at step 535 and proceeds to the next timed interval of thetime-aligned video frames for processing. If the server processor 110 isunable to find a match for a given timed interval of the time-alignedvideo frame, then server processor 110 skips the current timed intervalof timed-aligned video frames and proceeds to the next timed interval ofthe time-aligned video frames for processing.

In accordance with an exemplary embodiment of the claimed invention, theserver processor 110 or an object recognition server extracts objectsfrom the transcribed video/audio and matches each of the extractedobjects to known objects stored in the object database 533 and/or thedata warehouse 130. The server processor 110 identifies/recognizesobjects/places/things via an image recognition analysis. In accordancewith an exemplary embodiment of the claimed invention, the serverprocessor 110 compares the extracted objects/places/things againstgeometrical patterns stored in the object database 533 and/or the datawarehouse 130. The server processor 110 also extracts and associatesthese matched objects/places/things back to the extracted transcribedtext, preferably down to the second/millisecond, to facilitate searchingby objects/places/things and transcribed text simultaneously. Examplesof an object/place/thing are dress, purse, other clothing, building,statute, landmark, city, country, local, coffee mug, other common itemsand the like.

The server processor 110 performs object recognition in much the sameway as the facial recognition. Instead of analyzing “facial” features,the server processor 110 analyzes the basic boundaries of an object. Forexample, the server processor 110 analyzes the outer points of theEiffel tower's construction, analyzes a photo, pixel by pixel andcompares it to a stored object “fingerprint” file to detect the object.The object “fingerprint” files are stored in the object database 533and/or the data warehouse 130.

Once the various extraction processes has been executed on thetime-aligned textual transcription, time-aligned audio frames and/ortime-aligned video frames, the server processor 110 updates the datawarehouse 130 with these new pieces of time-aligned metadata associatedthe source media.

Returning to FIG. 3, the process of which the user can utilize andsearch the time-aligned extracted metadata associated with the sourcefile will now be described. As noted herein, the source file can beprinted non-digital content, audio/video/image media. A user, preferablyan authorized user, logs on to the serve 100 over the communicationsnetwork 300. Preferably, the server 100 authenticates the user using anyknown verification methods, e.g., userid and password, etc., beforeproviding access to the data warehouse 130. The client processor 210 ofthe client device 200 associated with the user transmits the datarequest or search query to the server 100 over the communicationsnetwork 300 via the connection facility 260 at step 316. The serverprocessor 110 receives the data request/search query from the user'sclient device 200 via the connection facility 140. It is appreciatedthat the originating source of the query can be an automated externalserver process, automated internal server process, one-time externalrequest, one-time internal request or other comparable process/request.In accordance with an exemplary embodiment of the claimed invention, theserver 100 presents a graphical user interface (GUI), such as web basedGUI or pre-compiled GUI, on the display 220 of the user's client device200 for receiving and processing the data request or search query by theuser at step 315. Alternatively, the server 100 can utilize anapplication programming interface (API), direct query or othercomparable means to receive and process data request from the user'sclient device 200. That is, once the search query is received from theuser's client device 200, the server processor 110 converts the textualdata (i.e., data request or search query) into an acceptable format fora local or remote Application Programming Interface (API) request to thedata warehouse 130 containing time-aligned metadata associated withsource media at step 313. The data warehouse 130 returns languageanalytics results of one or more of the following correlated with thenormalized amplitude value : a) temporal aggregated natural languageprocessing 309, such as sentiment, entity/topic analysis,socio-demographic or demographic information sentiment; b) temporalaggregated psychological analysis 310; c) temporal aggregated audiometadata analysis 311; and d) temporal aggregated visual metadataanalysis 312. In accordance with an exemplary embodiment of the claimedinvention, the server 100 can allow for programmatic, GUI or directselective querying of the time-aligned textual transcription andmetadata stored in the data warehouse 130 as result of variousextraction processing and analysis on the source video/audio file. Forexample, this advantageously enables the claimed invention to extractmetadata, e.g., textual sentiment, facial or objection recognition, forperiod of time proceeding or succeeding an event associated with ahighest sound (e.g., a riot) across a multitude of media sources.

In accordance with an exemplary embodiment of the claimed invention, thetemporal aggregated natural language processing API server providesnumerical or textual representation of sentiment. That is, the sentimentis provided on a numerical scale, a positive sentiment on a numericalscale, a negative sentiment on a numerical scale and a neutral sentimentbeing zero (0). These results are achieved the server processor 110using natural language processing analyses. Specifically, the serverprocessor queries the data against positive/negative weighed words andphrases stored in a server database or data warehouse 130.

In accordance with an exemplary embodiment of the claimed invention, theserver processor 110 or a server based hardware component interactsdirectly with the data warehouse 130 to query and analyze the storedmedia of time-aligned metadata for natural language processed,sentiment, demographic and/or socio-demographic information at step 309.Preferably, the system utilizes a natural language processing API serverto query and analyze the stored media. It is appreciated that afteranalysis the source media, the server processor 110 updates the datawarehouse 130 with the extracted information, such as the extractedtime-aligned sentiment, natural-language processed and demographicmetadata.

In accordance with an exemplary embodiment of the claimed invention, theserver processor 110 or a server based hardware component interactsdirectly with the data warehouse 130 to query and analyze the storedmedia of time-aligned metadata for psychological information at step310. Preferably, the system utilizes a psychological analysis API serverto query and analyze the stored time-aligned psychological metadata. Itis appreciated that after analysis the source media, the serverprocessor 110 updates the data warehouse 130 with the extractedinformation, such as the extracted time-aligned psychological metadata.

In accordance with an exemplary embodiment of the claimed invention, thetemporal aggregated psychological analysis API server provides numericalor textual representation of the psychological profile or model. Thatis, a variety of psychological indicators are returned indicating thepsychological profile of individuals speaking in a segmented or entiretranscribed text or transcript. The server processor 110 compares theword/phrase content appearing in the analyzed transcribed text againstthe stored weighed psychological data, e.g., the stored predefinedword/psychological profile associations, in the psychological database454 or the server database 130.

In accordance with an exemplary embodiment of the claimed invention, theserver processor 110 or a server based hardware component interactsdirectly with the data warehouse 130 to query and analyze stored mediaof time-aligned metadata for audio information at step 311. Preferably,the system utilizes an audio metadata analysis API server to query andanalyze time-aligned audio metadata, such as the time-aligned amplitudemetadata. It is appreciated that after analysis the source media, theserver processor 110 updates the data warehouse 130 with the extractedinformation, such as the extracted time-aligned amplitude metadata.

In accordance with an exemplary embodiment of the claimed invention, theserver processor 110 or a server based hardware component interactsdirectly with the data warehouse 130 to query and analyze stored mediaof time-aligned metadata for visual information at step 312. Preferably,the system utilizes the visual metadata analysis API server to query andanalyze time-aligned visual metadata, such as the time-aligned OCR,facial recognition and object recognition metadata. It is appreciatedthat after analysis the source media, the server processor 110 updatesthe data warehouse 130 with the extracted information, such as theextracted time-aligned OCR, facial recognition and object recognitionmetadata.

In accordance with an exemplary embodiment of the claimed invention, thesystem comprises an optional language translation API server forproviding server-based machine translation of the returned data into ahuman spoken language selected by the user at step 314.

It is appreciated that any combination of data stored by the serverprocessor 110 in performing the conversion, metadata extraction andanalytical processing of untranscribed media can be searched. Thefollowing is a list of non-limiting exemplary searches: searching thecombined transcribed data (a search via an internet appliance for “hellohow are you” in a previously untranscribed audio/video stream);searching combined transcribed data for sentiment; searching combinedtranscribed data for psychological traits; searching combinedtranscribed data for entities/concepts/themes; searching the combinedtranscribed data for individuals (politicians, celebrities) incombination with transcribed text via facial recognition; and anycombination of the above searches.

Currently, the majority of video/audio streaming services allow forsearch solely by title, description and genre of the file. With theclaimed invention, a variety of unique search methods combiningextracted structured and unstructured textual, aural and visual metadatafrom media files is now possible. The following is non-limitingexemplary searches after the source media files have been transcribed inaccordance with the claimed invention:

-   -   search transcribed media for a specific textual phrase, only        when a specific person appears within 10 seconds of inputted        phrase, e.g., “Home Run,” combined with facial recognition of a        specific named baseball player (e.g., Derek Jeter);    -   search transcribed media for the term “Home Run,” when uttered        in a portion of the file where sentiment is negative;    -   search transcribed media for the term “Home Run,” ordered by        aural amplitude. This would allow a user to reveal the phrase        he/she is searching for, during a scene with the most        noise/action;    -   search transcribed media for the term “Home Run” when more than        5 faces are detected on screen at once. This could reveal a        celebration on the field. A specific example would be the 1986        World Series, when Tim Teufel hit a walk-off home run, and 10+        players celebrated at home plate.    -   search transcribed media an audio only file for the phrase “Home        Run” along with “New York Mets” when the content is editorial.        The server processor 110 applies psychological filters, e.g.,        “thinking” vs. “feeling,” to identify emotional/editorial        content vs. academic/thinking content; and    -   search transcribed media for a specific building, for example        “Empire State Building” when the phrase “was built” was uttered        in the file. This would allow for a novel search to find        construction videos of the Empire State Building.

In accordance with an exemplary embodiment of the claimed invention,when the transcribed media is searched for an entity, the system notonly extracts time-aligned natural language processed metadata relatedto the entity being search, but the time-aligned natural languageprocessed metadata is mapped to a single, normalized, universalamplitude scale, thereby providing a normalized amplitude level at thetime of utterance. That is, in the context of search and advertising,the claimed single, normalized, universal amplitude scale enables theclaimed system to determine the weight of the entity beingmentioned/viewed in the transcribed media.

In accordance with an exemplary embodiment of the claimed invention, thesystem can be also utilized to analyze transcribed media for demographicinformation, based upon database-stored text corpuses, broken down bytaxonomy. For example, the server processor 110 analyzes the transcribedmedia file in its entirety, then programmatically compares thetranscription to a stored corpus associated with all taxonomies. Forexample, the system can rank politics the highest versus all othertopical taxonomies and the system can associate gender/age-range areassociated with political content. This can advantageously permit theserver processor 110 to utilize the time-aligned metadata for targetedadvertising. The server processor 110 can apply these extracteddemographics with revealed celebrities/public figures to assist in thedevelopment of micro-target advertisements during streaming audio/video.

In accordance with an exemplary embodiment of the claimed invention, avast opportunities are available with the claimed system's ability tosearch transcribed video files via optical character recognition ofvideo frames. For example, a user can search for “WalMart”, and receivenot only spoken words, but appearances of the WalMart logo on the screen220 of her client device 200, extracted via optical characterrecognition on a still frame of the video by the server processor 110.

In accordance with an exemplary embodiment of the claimed invention, theclaimed system extracts at least one or more of the following metadataand makes them queryable using standard database query: audio amplitude,textual sentiment, natural language processing, demographic estimation,psychological profile, optical character recognition, facial recognitionand object recognition. The user can search any combination of theaforementioned metadata. For example, the user can search news clips ofspecific genre for large crowds rioting or protesting by initiating adatabase query with the following constraints: search only first threeminutes of media with more than 50 faces detected and with an audioamplitude scale of 90+ for over ten seconds. In another example, theuser can search the media files for utterance of word “stop” byinitiating a database query for a word “stop” with the amplitude scaleof 100. In yet another example, instead of relying on the ineffectiveconsumer survey at the end of the service call which is skipped by vastmajority of the consumer, the service provider can utilize the claimedsystem to determine actual customer experience with its customerrepresentative by initiating a database query for negative textualsentiment with an audio amplitude scale of 70+ over multiple consecutiveaudio/video frame, e.g., 5 seconds of someone complaining loudly.

The accompanying description and drawings only illustrate severalembodiments of a system, methods and interfaces for metadataidentification, searching and matching, however, other forms andembodiments are possible. Accordingly, the description and drawings arenot intended to be limiting in that regard. Thus, although thedescription above and accompanying drawings contain much specificity,the details provided should not be construed as limiting the scope ofthe embodiments but merely as providing illustrations of some of thepresently preferred embodiments. The drawings and the description arenot to be taken as restrictive on the scope of the embodiments and areunderstood as broad and general teachings in accordance with the presentinvention. While the present embodiments of the invention have beendescribed using specific terms, such description is for presentillustrative purposes only, and it is to be understood thatmodifications and variations to such embodiments may be practiced bythose of ordinary skill in the art without departing from the spirit andscope of the invention.

1-30. (canceled)
 31. A computer based method for transcribing andextracting metadata from a non-transcribed source media, comprising thesteps of: extracting an audio stream from the non-transcribed sourcemedia by a processor-based server; extracting time-aligned audio framesfrom the audio stream by an audio frame engine; processing thetime-aligned audio frames to extract audio amplitudes by a timedinterval, to measure aural amplitudes of the extracted audio amplitudesand assign a numerical value to each extracted audio amplitude toprovide time-aligned aural amplitudes by a server processor; generatingan audio histogram of the audio stream by the server processor;normalizing the audio stream to a single, normalized, universalamplitude scale by determining a loudest frame with a loudest sound anda softest frame with a softest sound within the audio stream by theserver processor; assigning a normalized minimum amplitude value to thesoftest frame of the audio stream and a normalized maximum amplitudevalue to the loudest frame of the audio stream; comparing each frame ofthe audio stream to the loudest frame and the softest frame by utilizingthe audio histogram and assigning a normalized amplitude value betweenthe normalized minimum amplitude value and the normalized maximumamplitude value to said each frame in accordance with a result of thecomparison; and storing the time-aligned audio frames, the time-alignedaural amplitudes and the normalized amplitude value of each frame of theaudio stream in a database.
 32. The computer based method of claim 31,further comprising the steps of: speech recognition processing of theaudio stream to transcribe the audio stream into a time-aligned textualtranscription by a speech recognition engine to provide a time-alignedmachine transcribed media; processing the time-aligned machinetranscribed media by the server processor to extract time-alignedtextual metadata associated with the source media; and storing thetime-aligned machine transcribed media and the time-aligned textualmetadata in the database.
 33. The computer based method of claim 31,further comprising the steps: extracting a video stream from the sourcemedia by a video frame engine; extracting time-aligned video frames fromthe video stream by the video frame engine; storing the time-alignedvideo frames in the database; and processing the time-aligned videoframes by the server processor to extract time-aligned visual metadataassociated with the source media.
 34. The computer based method of claim33, wherein the step of processing the time-aligned video frames furthercomprises the steps of: an optical character recognition (OCR) analysison the time-aligned video frames by the server processor to extracttime-aligned OCR metadata; extracting texts from graphics by a timedinterval from the time-aligned video frames; performing database lookupsbased on a dataset of predefined recognized fonts, letters and languagesstored in the database; and receiving one or more matched time-alignedOCR metadata from the database by the server processor.
 35. The computerbased method of claim 33, wherein the step of processing thetime-aligned video frames further comprises the steps of: performing afacial recognition analysis on the time-aligned video frames by theserver processor to extract time-aligned facial recognition metadata;extracting facial data points by a timed interval from the time-alignedvideo frames; performing database lookups based on a dataset ofpredefined facial data points for individuals stored in the database;and receiving one or more matched time-aligned facial metadata from thedatabase by the server processor.
 36. The computer based method of claim33, wherein the step of processing the time-aligned video frames furthercomprises the steps of: performing an object recognition analysis on thetime-aligned video frames by the server processor to extracttime-aligned object recognition metadata; extracting object data pointsby a timed interval from the time-aligned video frames; performingdatabase lookups based on a dataset of predefined object data points fora plurality of objects stored in the database; and receiving one or morematched time-aligned object metadata from the database by the serverprocessor.
 37. The computer based method of claim 33, wherein the stepof processing the time-aligned video frames further comprises the stepsof: an optical character recognition (OCR) analysis on the time-alignedvideo frames by the server processor to extract time-aligned OCRmetadata; performing a facial recognition analysis on the time-alignedvideo frames by the server processor to extract time-aligned facialrecognition metadata; and performing an object recognition analysis onthe time-aligned video frames by the server processor to extracttime-aligned object recognition metadata.
 38. A non-transitory computerreadable medium comprising computer executable code for transcribing andextracting metadata from a non-transcribed source media, the codecomprising instructions for: extracting an audio stream from thenon-transcribed source media by a processor-based server; extractingtime-aligned audio frames from the audio stream by an audio frameengine; processing the time-aligned audio frames to extract audioamplitudes by a timed interval, to measure aural amplitudes of theextracted audio amplitudes and assign a numerical value to eachextracted audio amplitude to provide time-aligned aural amplitudes by aserver processor; generating an audio histogram of the audio stream bythe server processor; normalizing the audio stream to a single,normalized, universal amplitude scale by determining a loudest framewith a loudest sound and a softest frame with a softest sound within theaudio stream by the server processor; assigning a normalized minimumamplitude value to the softest frame of the audio stream and anormalized maximum amplitude value to the loudest frame of the audiostream; comparing each frame of the audio stream to the loudest frameand the softest frame by utilizing the audio histogram and assigning anormalized amplitude value between the normalized minimum amplitudevalue and the normalized maximum amplitude value to said each frame inaccordance with a result of the comparison; and storing the time-alignedaudio frames, the time-aligned aural amplitudes and the normalizedamplitude value of each frame of the audio stream in a database.
 39. Thecomputer readable medium of claim 38, wherein said computer executablecode further comprises instructions for: speech recognition processingof the audio stream by a speech recognition engine to transcribe theaudio stream into a time-aligned textual transcription to provide atime-aligned machine transcribed media; processing the time-alignedmachine transcribed media by the server processor to extracttime-aligned textual metadata associated with the source media; andstoring the time-aligned machine transcribed media and the time-alignedtextual metadata in the database
 40. The computer readable medium ofclaim 38, wherein said computer executable code further comprisesinstructions for: extracting a video stream from the source media by avideo frame engine of a processor-based server; extracting time-alignedvideo frames from the video stream by the video frame engine; storingthe time-aligned video frames in the database; and processing thetime-aligned video frames by a server processor to extract time-alignedvisual metadata associated with the source media.
 41. The computerreadable medium of claim 40, wherein said computer executable codefurther comprises instructions for: an optical character recognition(OCR) analysis on the time-aligned video frames by the server processorto extract time-aligned OCR metadata; extracting texts from graphics bya timed interval from the time-aligned video frames; performing databaselookups based on a dataset of predefined recognized fonts, letters andlanguages stored in the database; and receiving one or more matchedtime-aligned OCR metadata from the database by the server processor. 42.The computer readable medium of claim 40, wherein said computerexecutable code further comprises instructions for: performing a facialrecognition analysis on the time-aligned video frames by the serverprocessor to extract time-aligned facial recognition metadata;extracting facial data points by a timed interval from the time-alignedvideo frames; performing database lookups based on a dataset ofpredefined facial data points for individuals stored in the database;and receiving one or more matched time-aligned facial metadata from thedatabase by the server processor.
 43. The computer readable medium ofclaim 40, wherein said computer executable code further comprisesinstructions for: performing an object recognition analysis on thetime-aligned video frames by the server processor to extracttime-aligned object recognition metadata; extracting object data pointsby a timed interval from the time-aligned video frames; performingdatabase lookups based on a dataset of predefined object data points fora plurality of objects stored in the database; and receiving one or morematched time-aligned object metadata from the database by the serverprocessor.
 44. A system for transcribing and extracting metadata from anon-transcribed source media, comprising: a processor based serverconnected to a communications system to receive and extract an audiostream from the source media, the server comprising: an audio frameengine to extract time-aligned audio frames from the audio stream; aserver processor to: process time-aligned audio frames to extract audioamplitudes by a timed interval, measure aural amplitude of the extractedaudio amplitudes; assign a numerical value to each extracted audioamplitude to provide time-aligned aural amplitudes; generate an audiohistogram of the audio stream; normalize the audio stream to a single,normalized, universal amplitude scale by determining a loudest framewith a loudest sound and a softest frame with a softest sound within theaudio stream; assign a normalized minimum amplitude value to the softestframe of the audio stream and a normalized maximum amplitude value tothe loudest frame of the audio stream; compare each frame of the audiostream to the loudest frame and the softest frame by utilizing the audiohistogram and assign a normalized amplitude value between the normalizedminimum amplitude value and the normalized maximum amplitude value tosaid each frame in accordance with a result of the comparison; and adatabase to store the time-aligned audio frames and the time-alignedaural amplitudes, and to store the normalized amplitude value of eachframe of the audio stream.
 45. The system of claim 44, wherein theserver further comprises a speech recognition engine to process theaudio stream to transcribe the audio stream into a time-aligned textualtranscription to provide a time-aligned machine transcribed media;wherein the server processor is configured to process the time-alignedmachine transcribed media to extract time-aligned textual metadataassociated with the non-transcribed source media; and wherein thedatabase stores the time-aligned machine transcribed media and thetime-aligned textual metadata associated with non-transcribed sourcemedia.
 46. The system of claim 44, wherein the server comprises a videoframe engine for extracting a video stream from the source media andextracting time-aligned video frames from the video stream; and whereinthe server processor processes the time-aligened video frames to extracttime-aligned visual metadata associated with the source media; andwherein the database stores the time-aligned video frames.
 47. Thesystem of claim 46, wherein the server processor performs an opticalcharacter recognition (OCR) analysis on the time-aligned video frames toextract time-aligned OCR metadata by: extracting texts from graphics bya timed interval from the time-aligned video frames; performing databaselookups based on a dataset of predefined recognized fonts, letters andlanguages stored in the database; and receiving one or more matchedtime-aligned OCR metadata from the database.
 48. The system of claim 46,wherein the server processor performs a facial recognition analysis onthe time-aligned video frames to extract time-aligned facial recognitionmetadata by: extracting facial data points by a timed interval from thetime-aligned video frames; performing database lookups based on adataset of predefined facial data points for individuals stored in thedatabase; and receiving one or more matched time-aligned facial metadatafrom the database.
 49. The system of claim 46, wherein the serverprocessor performs an object recognition analysis on the time-alignedvideo frames to extract time-aligned object recognition metadata by:extracting object data points by a timed interval from the time-alignedvideo frames; performing database lookups based on a dataset ofpredefined object data points for a plurality of objects stored in thedatabase; and receiving one or more matched time-aligned object metadatafrom the database.